In [2]:

    
%pylab inline
%qtconsole









    



Populating the interactive namespace from numpy and matplotlib



In [3]:

    
import sklearn
from sklearn import cluster,datasets

Dataset

Let's generate a simple gaussian mixture to benchmark K-Means.



In [7]:

    
N=10**4
D_gen=2
K_gen=10



In [27]:

    
X,A_true=sklearn.datasets.make_blobs(n_samples=N, n_features=D_gen, centers=K_gen, cluster_std=1.0, center_box=(-100.0, 100.0))

K-Means



In [12]:

    
K=10
N_inits=500
max_iter=20



In [17]:

    
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
                                             precompute_distances=True, n_init=N_inits,
                                             max_iter=max_iter, random_state=None, copy_x=True,
                                             n_jobs=1)









    



1 loops, best of 3: 5.07 s per loop

The documentation claims that the precompute_distances (True by default) will consume more memory but make the algorithm faster. For the our goal (run K-Means many times with a small number of iterations) this is not the case, as shown by the result below. The documentation does not offer insight on the influence that this parameter is having under the hood.



In [18]:

    
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random', 
                                            precompute_distances=False, n_init=N_inits, 
                                            max_iter=max_iter, random_state=None, copy_x=True, 
                                            n_jobs=1)









    



1 loops, best of 3: 3.66 s per loop

The K-Means function from the scikit-learn already supports CPU parallelization by changing a simple parameter in the function. As expected, using multiple cores resulted in a speedup, albeit small. The parallelization is being done on the pairwise distances computation. It breaks down the pairwise matrix by the number of jobs selected.



In [21]:

    
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
                                             precompute_distances=True, n_init=N_inits,
                                             max_iter=max_iter, random_state=None, copy_x=True,
                                             n_jobs=-1)









    



1 loops, best of 3: 6.63 s per loop



In [24]:

    
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
                                             precompute_distances=False, n_init=N_inits,
                                             max_iter=max_iter, random_state=None, copy_x=True,
                                             n_jobs=-1)









    



1 loops, best of 3: 2.29 s per loop

It should be noted that it seems that this implementation is being accelerated by compiled C code.n

This are the times from the code above before the installation of optimized MKL sklearn (from the Anaconda Accelerate addon).

1 loops, best of 3: 5.67 s per loop
1 loops, best of 3: 4.05 s per loop
1 loops, best of 3: 3.29 s per loop
1 loops, best of 3: 2.57 s per loop

CUDA

Following video tutorial on Youtube. Notebook available here.



In [2]:

    
import numbapro
from numbapro import jit, cuda, int32, float32, float64

#@jit(complex64(int32, float32, complex64), target="cpu")



In [3]:

    
numbapro.check_cuda()









    



------------------------------libraries detection-------------------------------
Finding cublas
	located at /home/chiroptera/anaconda/lib/libcublas.so.6.0.37
	trying to open library...	ok
Finding cusparse
	located at /home/chiroptera/anaconda/lib/libcusparse.so.6.0.37
	trying to open library...	ok
Finding cufft
	located at /home/chiroptera/anaconda/lib/libcufft.so.6.0.37
	trying to open library...	ok
Finding curand
	located at /home/chiroptera/anaconda/lib/libcurand.so.6.0.37
	trying to open library...	ok
Finding nvvm
	located at /home/chiroptera/anaconda/lib/libnvvm.so.2.0.0
	trying to open library...	ok
	finding libdevice for compute_20...	ok
	finding libdevice for compute_30...	ok
	finding libdevice for compute_35...	ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0         GeForce GT 520M                              [SUPPORTED]
                      compute capability: 2.1
                           pci device id: 0
                              pci bus id: 1
Summary:
	1/1 devices are supported
PASSED






    Out[3]:





True



In [5]:

    
#CPU version
@numbapro.vectorize(['float32(float32,float32)',
                     'float64(float64,float64)'],target='cpu')
def cpu_sincos(x,y):
    return math.sin(x) * math.cos(y)

#GPU version
@numbapro.vectorize(['float32(float32,float32)',
                     'float64(float64,float64)'],target='gpu')
def gpu_sincos(x,y):
    return math.sin(x) * math.cos(y)



In [6]:

    
n=10**6
x = np.linspace(0,np.pi,n)
y = np.linspace(0,np.pi,n)

print 'Data of type ',x.dtype

print 'Unoptimized\t',
%timeit np.sin(x) * np.cos(y)
print 'CPU vectorize\t',
%timeit cpu_sincos(x,y)
print 'GPU vectorize\t',
%timeit gpu_sincos(x,y)

del x,y









    



Data of type  float64
Unoptimized	10 loops, best of 3: 77.5 ms per loop
CPU vectorize	10 loops, best of 3: 73.2 ms per loop
GPU vectorize	100 loops, best of 3: 17.8 ms per loop

Unoptimized and CPU vectorize times are similar because sin() and cos() calls dominate time. GPU is quite faster already and would get even faster in a better GPU (current GPU has 48 CUDA cores).

All the data has to go from the RAM to the GPU memory, band back again. The work is distributed across all threads on the GPU and scheduling and memory management are taken cared of.

Another example



In [39]:

    
@numbapro.vectorize(['float32(float32,float32,float32,float32)'])
def cpu_powers(p,q,r,s):
    return math.sqrt(p**2 + q**3 + r**4 + s**5)

@numbapro.vectorize(['float32(float32,float32,float32,float32)'],target='gpu')
def gpu_powers(p,q,r,s):
    return math.sqrt(p**2 + q**3 + r**4 + s**5)

When target is not specified, it defaults to 'cpu'.



In [40]:

    
## Benchmarking
#Generate data
n=5e6
p = np.random.random(n).astype(np.float32)
q = np.random.random(n).astype(np.float32)
r = np.random.random(n).astype(np.float32)
s = np.random.random(n).astype(np.float32)

print 'Unoptimized\t',
%timeit np.sqrt(p**2 + q**3 + r**4 + s**5)
print 'CPU vectorize\t',
%timeit cpu_powers(p,q,r,s)
print 'GPU vectorize\t',
%timeit gpu_powers(p,q,r,s)

del p,q,r,s









    



Unoptimized	1 loops, best of 3: 2.99 s per loop
CPU vectorize	1 loops, best of 3: 2.25 s per loop
GPU vectorize	10 loops, best of 3: 80 ms per loop

Anaconda Accelerate includes the MKL optimization to several libraries. I still have to check if the numpy routines are not faster than the normal ones.

The speedup of GPU over CPU is of $$2.1 s / 69.4 ms = 30.26$$

We've now seen that @vectorize accelerates functions, specially on GPU, but it only works on scalar functions. Even though more complex operations like matrix multiplication can be decomposed in several scalar operations over its elements, it would be more convenient to operate on arrays directly.

GUVectorize and General Universal Functions (ufunc)

Now the function can't return anything and we should pass as a parameter where we want our result to be stored. Furthermore, it's now clear that the types are arrays and we have an extra descriptive string for the shape signature.



In [7]:

    
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)',target='gpu')
def batch_matrix_mult(a,b,c):
    for i in range(c.shape[0]):
        for j in range(c.shape[1]):
            tmp=0
            for n in range(a.shape[1]):
                tmp += a[i,n] * b[n,j]
            c[i,j]=tmp
#batch_matrix_mult.max_blocksize = 32  # limits to 32 threads per block

This is the code from the video. It seems to be a simple matrix multiplication though. In the video they say that it's a batch multuplication...



In [9]:

    
n=4e6
dim=2

print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'

a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)


print 'Exception when too much resources of GPU are used:'
try:
    batch_matrix_mult(a,b)
except Exception as exception:
    print exception

batch_matrix_mult.max_blocksize = 64  # limits to 32 threads per block    

import numpy.core.umath_tests as ut

print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)

print 'GUVectorize GPU\t',
%timeit c2 = batch_matrix_mult(a,b)

del n,dim,a,b









    



Memory for both matrices:	125000.0  KBytes
Exception when too much resources of GPU are used:
Unoptimized	1 loops, best of 3: 169 ms per loop
GUVectorize GPU	1 loops, best of 3: 255 ms per loop



In [23]:

    
del n,dim,a,b

From the documentation:

There are times when the gufunc kernel uses too many of a GPU’s resources, which can cause the kernel launch to fail. The user can explicitly control the maximum size of the thread block by setting the max_blocksize attribute on the compiled gufunc object.

To control the size of the thread block we use (e.g. for 32 threds per block):

very_complex_kernel.max_blocksize = 32

The speedup was of

Manual memory management

The performance increases when the memory is manually managed. We'll run the same test as before, but now we'll copy the arrays to the GPU's memory and also allocate memory for the result.



In [11]:

    
n=4e6
dim=2

print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'

a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)

dc = numbapro.cuda.device_array_like(a)
da = numbapro.cuda.to_device(a)
db = numbapro.cuda.to_device(b)

def check_pure_compute_time(da,db,dc):
    batch_matrix_mult(da,db,out=dc)
    numbapro.cuda.synchronize()
    
import numpy.core.umath_tests as ut

print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)

print 'Memory management'
batch_matrix_mult.max_blocksize = 64  # limits to 32 threads per block    

print 'Automatic:\t\t',
%timeit c=batch_matrix_mult(a,b)
print 'Manual:\t\t\t',
%timeit check_pure_compute_time(da,db,dc)

del n,dim,a,b,da,db,dc









    



Memory for both matrices:	125000.0  KBytes
Unoptimized	1 loops, best of 3: 161 ms per loop
Memory management
Automatic:		1 loops, best of 3: 239 ms per loop
Manual:			10 loops, best of 3: 115 ms per loop

CUDA Execution Model

Kernel function

are only visible o the host CPU
cannot return any value (use output argument)
associates to a grid

A grid is a group of blocks (that can be 1D,2D or 3D) and each block is a group of threads (which can also be 1D, 2D or 3D). Every thread will execute the same kernel function.

When we're using @jit, we're writing functions on the CUDA execution model. However, we wrote functions that executed on GPU before with a return type (which can't happen here). That is because before we were using the @vectorize and not working under the CUDA execution model.



In [19]:

    
%qtconsole

Distance calculation



In [4]:

    
import numbapro
from numbapro import *

Let's define a pure Python implementation of the square of the Euclidean distance between matrices a and b. This is very inefficient.



In [5]:

    
def dist(a,b,c):
    N,K = c.shape
    dim = a.shape[1]
   
    for n in xrange(N):
        for k in xrange(K):
            dist=0
            for d in xrange(dim):
                diff = a[n,d]-b[k,d]
                dist += diff ** 2
            c[n,k]=dist

This version makes more use of the NumPy module, which has compiled code.



In [6]:

    
def np_dist(a,b,c):
    N,K = c.shape
    dim = a.shape[1]
   
    for k in range(K):
        dist = a - b[k]
        dist = dist ** 2
        c[:,k] = dist.sum(axis=1)



In [7]:

    
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
                      '(m,n),(p,n)->(m,p)',target='gpu')
def guvect_dist(a,b,c):
    N,K = c.shape
    dim = a.shape[1]
   
    for n in range(N):
        for k in range(K):
            dist=0
            for d in range(dim):
                diff = a[n,d]-b[k,d]
                #dist += np.power(diff,2)
                dist += diff ** 2
            c[n,k]=dist

This function defines a lower level implementation of the distance computation. The code of the kernel will be executed by a thread. We want a thread to compute the value of a distance from one point to one cluster so we'll need $n \times k$ threads to compute the entire distance matrix.

Because each thread corresponds to a point in a matrix we have to know which point that is. Each thread as an index inside the block it belongs too (which can be multidimensional). Each block has an index inside the grid (which can also be multidimensional). The dimension of the grid and the blocks doesn't have to match, which means we can have multidimensional blocks inside a unidimensional grid. These hierarchy is logical.

We use the cuda.grid(2) function to retrieve the bidimensional global index. That index will be the one used to index the distance matrix.



In [8]:

    
@numbapro.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def cu_dist(a,b,c):

    """
    ## vanilla thread index
    # thread ID inside block
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    
    # block ID
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    
    # block dimensions
    bw = cuda.blockDim.x
    bh = cuda.blockDim.y
    
    # compute thread's x and y index (i.e. datapoint and cluster)
    k = tx + bx * bw # column for each cluster
    n = ty + by * bh # row for each datapoint
    """
    
    """
    ## long vertical matrix thread index
    # thread ID inside block
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    
    # block ID
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    
    # block dimensions
    bw = cuda.blockDim.x
    bh = cuda.blockDim.y
    
    # grid dimensions
    gw = cuda.gridDim.x
    gh = cuda.gridDim.y
    
    # compute thread's x and y index (i.e. datapoint and cluster)
    k = tx # block width is the same as matrix width
    # the second column of blocks means we want to add
    # 2**16 to the index
    n = ty + by * bh + bx*gh*bh
    """
    
    
    
    k,n = numbapro.cuda.grid(2)
    
    ch, cw = c.shape # c width and height

    if n >= ch or k >= cw:
        return

    dist = 0.0
    for d in range(a.shape[1]):
        diff = a[n,d]-b[k,d]
        dist += diff ** 2
    c[n,k]= dist

    #cuda.syncthreads()
    
def cu_dist_wrap(a,b,c,blockDim,gridDim):
    #stream = cuda.stream()
    dA = cuda.to_device(a)
    dB = cuda.to_device(b)
    #dC = cuda.to_device(c)
    #dC = numbapro.cuda.device_array(c.shape)
    dC = numbapro.cuda.device_array_like(c)
    
    cu_dist[gridDim,blockDim](dA,dB,dC)
    numbapro.cuda.synchronize()
    
    
    #cu_dist[gridDim,blockDim,stream](dA,dB,dC)
    #dC.to_host()
    #stream.synchronize() #block untill stream is done
    #result=dC.copy_to_host()
    dC.copy_to_host(ary=c)
   # print 'type of result=',type(result)
    #c=result
    #return result

The maximum number of blocks in each grid dimension is $2^{16}$. For a big matrix (one of the dimensions bigger than $2^{16}$)



In [14]:

    
MAX_GRID_XYZ_DIM = 65535
MAX_BLOCK_XY_DIM = 1024
MAX_BLOCK_Z_DIM = 64

#%debug
##generate data
n = 1e4
d = 2
k = 20

total_bytes = (n * d + k * d + n * k) * 4
print 'Memory used by arrays:\t',total_bytes/1024,'\tKBytes'
print '\t\t\t',total_bytes/(1024*1024),'\tMBytes'

print 'Memory used by data:  \t',n * d * 4 / 1024,'\t','KBytes'

data = np.random.random((n,d)).astype(np.float32)
clusters = np.random.random((k,d)).astype(np.float32)
#dists = np.empty((n,k)).astype(np.float32)
distsDim = np.int(n),np.int(k)

blockDim = 20,25 # GT520M has 48 CUDA cores
tpg = np.prod(blockDim)

# blocks per grid
bpg = np.ceil( np.float(n * k) / tpg )

gridWidth = np.ceil(bpg / MAX_GRID_XYZ_DIM)
gridHeight = np.ceil(bpg / gridWidth)

gridDim = np.int(gridWidth),np.int(gridHeight)


total_threads = np.prod(gridDim) * np.prod(blockDim)
needed_threads = n*k
print 'Block dimension:  \t',blockDim
print 'Grid dimension:   \t',gridDim
print 'Threads per block:\t',tpg
print 'Blocks per grid:  \t',bpg
print 'Total threads:    \t',total_threads
print 'Excesss threads:  \t', total_threads - needed_threads

if gridDim[0] > MAX_GRID_XYZ_DIM or gridDim[1] > MAX_GRID_XYZ_DIM:
    print "\n* * * * * * * * * * * * * * * * * * * * * * * * * * *\n"
    print "WARNING: Grid dimension too big. Maximum dimension = ", MAX_GRID_XYZ_DIM
    print "\n* * * * * * * * * * * * * * * * * * * * * * * * * * *"









    



Memory used by arrays:	859.53125 	KBytes
			0.839385986328 	MBytes
Memory used by data:  	78.125 	KBytes
Block dimension:  	(20, 25)
Grid dimension:   	(1, 400)
Threads per block:	500
Blocks per grid:  	400.0
Total threads:    	200000
Excesss threads:  	0.0



In [50]:

    
my_dev=cuda.get_current_devicet_device()
cuda.api.driver.Device.reset(my_dev)
cuda.np.









    Out[50]:





<module 'numpy' from '/home/chiroptera/anaconda/lib/python2.7/site-packages/numpy/__init__.pyc'>



In [30]:

    
%qtconsole



In [43]:

    
#%debug

#distsPython = np.empty(distsDim, dtype=np.float32)
distsNumPy = np.empty(distsDim, dtype=np.float32)
distsCUDA_auto = np.ones(distsDim, dtype=np.float32)*(-1)
distsCUDA_manual = np.ones(distsDim, dtype=np.float32)*(-1)
distsCUDA_gu = np.empty(distsDim, dtype=np.float32)

#dist(data,clusters,distsPython)
np_dist(data,clusters,distsNumPy)
guvect_dist.max_blocksize = 32  # limits to 32 threads per block
#distsCUDA_gu=guvect_dist(data,clusters)
cu_dist_wrap(data,clusters,distsCUDA_manual,blockDim,gridDim)
cu_dist[gridDim,blockDim](data,clusters,distsCUDA_gu)

#print 'Python = NumPy\t\t',np.allclose(distsPython,distsNumPy)
print 'NumPy = CUDA manual\t',np.allclose(distsNumPy,distsCUDA_manual)
print 'NumPy = CUDA auto\t',np.allclose(distsNumPy,distsCUDA_auto)
#print 'CUDA auto = manual\t',np.allclose(distsCUDA_auto,distsCUDA_manual)
print 'NumPy = CUDA gu\t',np.allclose(distsNumPy,distsCUDA_gu)

if 'distsPython' in locals(): del distsPython
if 'distsNumPy' in locals(): del distsNumPy
if 'distsCUDA_auto' in locals(): del distsCUDA_auto
if 'distsCUDA_manual' in locals(): del distsCUDA_manual
if 'distsCUDA_gu' in locals(): del distsCUDA_gu









    



---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
<ipython-input-43-ffe1d791adc0> in <module>()
     11 guvect_dist.max_blocksize = 32  # limits to 32 threads per block
     12 #distsCUDA_gu=guvect_dist(data,clusters)
---> 13 cu_dist_wrap(data,clusters,distsCUDA_manual,blockDim,gridDim)
     14 cu_dist[gridDim,blockDim](data,clusters,distsCUDA_gu)
     15 

<ipython-input-8-47e246677271> in cu_dist_wrap(a, b, c, blockDim, gridDim)
     65 def cu_dist_wrap(a,b,c,blockDim,gridDim):
     66     #stream = cuda.stream()
---> 67     dA = cuda.to_device(a)
     68     dB = cuda.to_device(b)
     69     #dC = cuda.to_device(c)

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/devices.pyc in _require_cuda_context(*args, **kws)
    237     def _require_cuda_context(*args, **kws):
    238         get_context()
--> 239         return fn(*args, **kws)
    240 
    241     return _require_cuda_context

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/api.pyc in to_device(ary, stream, copy, to)
     53     """
     54     if to is None:
---> 55         devarray = devicearray.from_array_like(ary, stream=stream)
     56     else:
     57         devarray = to

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/devicearray.pyc in from_array_like(ary, stream, gpu_head, gpu_data)
    336     return DeviceNDArray(ary.shape, ary.strides, ary.dtype,
    337                          writeback=ary, stream=stream, gpu_head=gpu_head,
--> 338                          gpu_data=gpu_data)
    339 
    340 

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/devicearray.pyc in __init__(self, shape, strides, dtype, stream, writeback, gpu_head, gpu_data)
     94                                                                 self.strides,
     95                                                                 self.dtype.itemsize)
---> 96                 gpu_data = devices.get_context().memalloc(self.alloc_size)
     97             else:
     98                 self.alloc_size = _driver.device_memory_size(gpu_data)

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/cuda/cudadrv/driver.pyc in memalloc(self, bytesize)
    497 
    498     def memalloc(self, bytesize):
--> 499         self.trashing.service()
    500         ptr = drvapi.cu_device_ptr()
    501         driver.cuMemAlloc(byref(ptr), bytesize)

/home/chiroptera/anaconda/lib/python2.7/site-packages/numba/servicelib/service.pyc in service(self)
     28                 # Prevent recursion
     29                 self.enabled = False
---> 30                 next(self._task)
     31             finally:
     32                 self.enabled = enable

StopIteration:



In [29]:

    
dists = np.empty(distsDim).astype(np.float32)


print 'Pure python:\t\t',
%timeit -r 1 dist(data,clusters,dists)
print 'NumPy optimized:\t',
%timeit np_dist(data,clusters,dists)
#print 'GUVectorize:\t\t',
#%timeit -r 1 guvect_dist(data,clusters,dists)
print 'CUDA auto:\t\t',
%timeit cu_dist[gridDim,blockDim](data,clusters,dists)
print 'CUDA manual:\t\t',
%timeit cu_dist_wrap(data,clusters,dists,blockDim,gridDim)

del dists









    



Pure python:		1 loops, best of 1: 32min 51s per loop
NumPy optimized:	1 loops, best of 3: 6.93 s per loop
CUDA auto:		1 loops, best of 3: 1.31 s per loop
CUDA manual:		1 loops, best of 3: 1.13 s per loop

Pure python:        1 loops, best of 1: 32min 51s per loop
NumPy optimized:    1 loops, best of 3: 6.93 s per loop
CUDA auto:          1 loops, best of 3: 1.31 s per loop
CUDA manual:        1 loops, best of 3: 1.13 s per loop



In [31]:

    
speedup=34*60 / 1.13
print speedup









    



1805.30973451



In [32]:

    
del data,clusters

Accuracy check

This section implements a simple matrix multiplication kernel. The aim here is mainly to test accuracy.



In [2]:

    
import numbapro
from numbapro import *



In [15]:

    
from timeit import default_timer as timer

bpg = 150
tpb = 6
n = bpg * tpb
shared_mem_size = (tpb, tpb)
griddim = bpg, bpg
blockdim = tpb, tpb

# Prepare data on the CPU
#A = np.array(np.random.random((n, n)), dtype=np.float32)
#B = np.array(np.random.random((n, n)), dtype=np.float32)

A = np.ones((n, n), dtype=np.float32)
B = np.ones((n, n), dtype=np.float32)

print "N = %d x %d" % (n, n)









    



N = 900 x 900



In [3]:

    
@numbapro.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def naive_matrix_mult(A, B, C):
    x, y = cuda.grid(2)
    if x >= C.shape[1] or y >= C.shape[0]:
        return

    C[y, x] = 0
    for i in range(A.shape[1]):
        C[y, x] += A[y, i] * B[i, x]
        
def py_naive_matrix_mult(A, B, C):

    for y in xrange(C.shape[0]):
        for x in xrange(C.shape[1]):
    
            C[y, x] = 0
            for i in xrange(A.shape[1]):
                C[y, x] += A[y, i] * B[i, x]



In [17]:

    
# Prepare data on the GPU
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)

# Time the unoptimized version
s = timer()
naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numbapro.cuda.synchronize()
e = timer()
unopt_ans = dC.copy_to_host()
tcuda_unopt = e - s


# Time the Python version
py_ans = np.empty(unopt_ans.shape)
s = timer()
#py_naive_matrix_mult(A, B, py_ans)
py_ans=A.dot(B)
numba.cuda.synchronize()
e = timer()
t_py = e - s

print 'Times'
print 'CUDA  \t',tcuda_unopt
print 'Python\t',t_py
print ''
print 'Are the same:\t',np.allclose(unopt_ans,py_ans)









    



Times
CUDA  	2.46930098534
Python	0.0447430610657

Are the same:	True

Storing memory for multiple kernels test

Using last section's kernel.



In [5]:

    
bpg = 150
tpb = 6
n = bpg * tpb
shared_mem_size = (tpb, tpb)
griddim = bpg, bpg
blockdim = tpb, tpb

A = np.ones((n, n), dtype=np.float32)
B = np.ones((n, n), dtype=np.float32)

print "N = %d x %d" % (n, n)
print "Memory:\t",A.nbytes*3 / 1024, '\t','KBytes'









    



N = 900 x 900
Memory:	9492 	KBytes



In [6]:

    
# First operation
res1 = A.dot(B)

dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)

naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numbapro.cuda.synchronize()
cuda_res1 = dC.copy_to_host()

B += 1 

# Second operation
res2 = A.dot(B)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)

naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numbapro.cuda.synchronize()
cuda_res2 = dC.copy_to_host()

# check results
print 'Result 1 check:','\t',np.allclose(res1,cuda_res1)
print 'Result 2 check:','\t',np.allclose(res2,cuda_res2)









    



Result 1 check: 	True
Result 2 check: 	True